12 research outputs found

    Generation of Policy-Level Explanations for Reinforcement Learning

    Full text link
    Though reinforcement learning has greatly benefited from the incorporation of neural networks, the inability to verify the correctness of such systems limits their use. Current work in explainable deep learning focuses on explaining only a single decision in terms of input features, making it unsuitable for explaining a sequence of decisions. To address this need, we introduce Abstracted Policy Graphs, which are Markov chains of abstract states. This representation concisely summarizes a policy so that individual decisions can be explained in the context of expected future transitions. Additionally, we propose a method to generate these Abstracted Policy Graphs for deterministic policies given a learned value function and a set of observed transitions, potentially off-policy transitions used during training. Since no restrictions are placed on how the value function is generated, our method is compatible with many existing reinforcement learning methods. We prove that the worst-case time complexity of our method is quadratic in the number of features and linear in the number of provided transitions, O(F2tr_samples)O(|F|^2 |tr\_samples|). By applying our method to a family of domains, we show that our method scales well in practice and produces Abstracted Policy Graphs which reliably capture relationships within these domains.Comment: Accepted to Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence (2019

    Unifying State and Policy-Level Explanations for Reinforcement Learning

    No full text
    Reinforcement learning (RL) is able to solve domains without needing to learn a model of the domain dynamics. When coupled with a neural network as a function approximator, RL systems can solve complex problems. However, verifying and predicting RL agent behavior is made difficult by these same properties; a learned policy conveys “what” to do, but not “why.” This thesis focuses on producing explanations for deep RL, summaries of behavior and their causes that can be used for downstream analysis. Specifically, we focus on the setting where the final policy is obtained from a limited, known set of interactions with the environment. We categorize existing explanation methods along two axes: Whether a method explains single-action behavior or policy-level behavior Whether a method provides explanations in terms of state features or past experiences Under this classification, there are four types of explanation methods, and they enable answering different questions about an agent. We introduce methods for creating explanations of these types. Furthermore, we introduce a unified explanation structure that is a combination of all four types. This structure enables obtaining further information about what an agent has learned and why it behaves as it does. First, we introduce CUSTARD, our method for explaining single-action behavior in terms of state features. CUSTARD’s explanation is a decision tree representation of the policy. Unlike existing methods for producing such a decision tree, CUSTARD directly learns the tree without approximating a policy after training and is compatible with existing RL techniques. We then introduce APG-Gen, our approach for creating a policy-level behavior explanation in terms of state features. APG-Gen produces a Markov chain over abstract states that enables predicting future actions and aspects of future states. APG-Gen only queries an agent’s Q-values, making no assumptions about an agent’s decision-making process. We integrate these two methods to produce a Unified Explanation Tree (UET). A UET is a tree that maps from a state directly to both an action and an abstract state, thus unifying single-action and policy-level behavior explanations in terms of state features. We extend existing work on finding important training points in deep neural networks. Our method, MRPS, produces explanations of single-action behavior in terms of past experiences. MRPS can find importance values for sets of points and accounts for feature magnitudes to produce more meaningful importance values. Finally, we find the importance values of sets of past experiences for any node within a UET. Additionally, we introduce methods for computing approximate and exact influence for UET nodes. Since a UET conveys both single-action and policy-level behavior, these importance and influence values explain both levels of behavior in terms of past experiences. Our overall solution enables identifying the portion of the UET that would change if specific experiences were removed or added from the set used by the agent.</p

    Iterative Bounding MDPs: Learning Interpretable Policies via Non-Interpretable Methods

    No full text
    Current work in explainable reinforcement learning generally produces policies in the form of a decision tree over the state space. Such policies can be used for formal safety verification, agent behavior prediction, and manual inspection of important features. However, existing approaches fit a decision tree after training or use a custom learning procedure which is not compatible with new learning techniques, such as those which use neural networks. To address this limitation, we propose a novel Markov Decision Process (MDP) type for learning decision tree policies: Iterative Bounding MDPs (IBMDPs). An IBMDP is constructed around a base MDP so each IBMDP policy is guaranteed to correspond to a decision tree policy for the base MDP when using a method-agnostic masking procedure. Because of this decision tree equivalence, any function approximator can be used during training, including a neural network, while yielding a decision tree policy for the base MDP. We present the required masking procedure as well as a modified value update step which allows IBMDPs to be solved using existing algorithms. We apply this procedure to produce IBMDP variants of recent reinforcement learning methods. We empirically show the benefits of our approach by solving IBMDPs to produce decision tree policies for the base MDPs
    corecore